Natural Value Approximators: Learning when to Trust Past Estimates
نویسندگان
چکیده
Neural networks have a smooth initial inductive bias, such that small changes in input do not lead to large changes in output. However, in reinforcement learning domains with sparse rewards, value functions have non-smooth structure with a characteristic asymmetric discontinuity whenever rewards arrive. We propose a mechanism that learns an interpolation between a direct value estimate and a projected value estimate computed from the encountered reward and the previous estimate. This reduces the need to learn about discontinuities, and thus improves the value function approximation. Furthermore, as the interpolation is learned and state-dependent, our method can deal with heterogeneous observability. We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm. 1 Motivation The central problem of reinforcement learning is value function approximation: how to accurately estimate the total future reward from a given state. Recent successes have used deep neural networks to approximate the value function, resulting in state-of-the-art performance in a variety of challenging domains [9]. Neural networks are most effective when the desired target function is smooth. However, value functions are, by their very nature, discontinuous functions with sharp variations over time. In this paper we introduce a representation of value that matches the natural temporal structure of value functions. A value function represents the expected sum of future discounted rewards. If non-zero rewards occur infrequently but reliably, then an accurate prediction of the cumulative discounted reward rises as such rewarding moments approach and drops immediately after. This is depicted schematically with the dashed black line in Figure 1. The true value function is quite smooth, except immediately after receiving a reward when there is a sharp drop. This is a pervasive scenario because many domains associate positive or negative reinforcements to salient events (like picking up an object, hitting a wall, or reaching a goal position). The problem is that the agent’s observations tend to be smooth in time, so learning an accurate value estimate near those sharp drops puts strain on the function approximator – especially when employing differentiable function approximators such as neural networks that naturally make smooth maps from observations to outputs. To address this problem, we incorporate the temporal structure of cumulative discounted rewards into the value function itself. The main idea is that, by default, the value function can respect the reward sequence. If no reward is observed, then the next value smoothly matches the previous value, but 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Figure 1: After the same amount of training, our proposed method (red) produces much more accurate estimates of the true value function (dashed black), compared to the baseline (blue). The main plot shows discounted future returns as a function of the step in a sequence of states; the inset plot shows the RMSE when training on this data, as a function of network updates. See section 4 for details. becomes a little larger due to the discount. If a reward is observed, it should be subtracted out from the previous value: in other words a reward that was expected has now been consumed. The natural value approximator (NVA) combines the previous value with the observed rewards and discounts, which makes this sequence of values easy to represent by a smooth function approximator such as a neural network. Natural value approximators may also be helpful in partially observed environments. Consider a situation in which an agent stands on a hill top. The goal is to predict, at each step, how many steps it will take until the agent has crossed a valley to another hill top in the distance. There is fog in the valley, which means that if the agent’s state is a single observation from the valley it will not be able to accurately predict how many steps remain. In contrast, the value estimate from the initial hill top may be much better, because the observation is richer. This case is depicted schematically in Figure 2. Natural value approximators may be effective in these situations, since they represent the current value in terms of previous value estimates. 2 Problem definition We consider the typical scenario studied in reinforcement learning, in which an agent interacts with an environment at discrete time intervals: at each time step t the agent selects an action as a function of the current state, which results in a transition to the next state and a reward. The goal of the agent is to maximize the discounted sum of rewards collected in the long run from a set of initial states [12]. The interaction between the agent and the environment is modelled as a Markov Decision Process (MDP). An MDP is a tuple (S,A, R, γ, P ) where S is a state space, A is an action space, R : S×A×S → D(R) is a reward function that defines a distribution over the reals for each combination of state, action, and subsequent state, P : S × A → D(S) defines a distribution over subsequent states for each state and action, and γt ∈ [0, 1] is a scalar, possibly time-dependent, discount factor. One common goal is to make accurate predictions under a behaviour policy π : S → D(A) of the value vπ(s) ≡ E [R1 + γ1R2 + γ1γ2R3 + . . . | S0 = s] . (1) The expectation is over the random variables At ∼ π(St), St+1 ∼ P (St, At), and Rt+1 ∼ R(St, At, St+1), ∀t ∈ N. For instance, the agent can repeatedly use these predictions to improve its policy. The values satisfy the recursive Bellman equation [2] vπ(s) = E [Rt+1 + γt+1vπ(St+1) | St = s] . We consider the common setting where the MDP is not known, and so the predictions must be learned from samples. The predictions made by an approximate value function v(s;θ), where θ are parameters that are learned. The approximation of the true value function can be formed by temporal 2 difference (TD) learning [10], where the estimate at time t is updated towards Z t ≡ Rt+1 + γt+1v(St+1;θ) or Z t ≡ n ∑ i=1 (Πi−1 k=1γt+k)Rt+i + (Π n k=1γt+k)v(St+n;θ) ,(2) where Z t is the n-step bootstrap target, and the TD-error is δ n t ≡ Z t − v(St;θ). 3 Proposed solution: Natural value approximators The conventional approach to value function approximation produces a value estimate from features associated with the current state. In states where the value approximation is poor, it can be better to rely more on a combination of the observed sequence of rewards and older but more reliable value estimates that are projected forward in time. Combining these estimates can potentially be more accurate than using one alone. These ideas lead to an algorithm that produces three estimates of the value at time t. The first estimate, Vt ≡ v(St;θ), is a conventional value function estimate at time t. The second estimate, Gpt ≡ Gβt−1 −Rt γt if γt > 0 and t > 0 , (3) is a projected value estimate computed from the previous value estimate, the observed reward, and the observed discount for time t. The third estimate, Gβt ≡ βtG p t + (1− βt)Vt = (1− βt)Vt + βt Gβt−1 −Rt γt , (4) is a convex combination of the first two estimates1 formed by a time-dependent blending coefficient βt. This coefficient is a learned function of state β(·;θ) : S → [0, 1], over the same parameters θ, and we denote βt ≡ β(St;θ). We call Gβt the natural value estimate at time t and we call the overall approach natural value approximators (NVA). Ideally, the natural value estimate will become more accurate than either of its constituents from training. The value is learned by minimizing the sum of two losses. The first loss captures the difference between the conventional value estimate Vt and the target Zt, weighted by how much it is used in the natural value estimate, JV ≡ E [ [[1− βt]]([[Zt]]− Vt) ] , (5) where we introduce the stop-gradient identity function [[x]] = x that is defined to have a zero gradient everywhere, that is, gradients are not back-propagated through this function. The second loss captures the difference between the natural value estimate and the target, but it provides gradients only through the coefficient βt, Jβ ≡ E [ ([[Zt]]− (βt [[Gpt ]] + (1− βt)[[Vt]])) ] . (6) These two losses are summed into a joint loss, J = JV + cβJβ , (7) where cβ is a scalar trade-off parameter. When conventional stochastic gradient descent is applied to minimize this loss, the parameters of Vt are adapted with the first loss and parameters of βt are adapted with the second loss. When bootstrapping on future values, the most accurate value estimate is best, so using Gβt instead of Vt leads to refined prediction targets Z t ≡ Rt+1 + γt+1G β t+1 or Z β,n t ≡ n ∑ i=1 (Πi−1 k=1γt+k)Rt+i + (Π n k=1γt+k)G β t+n . (8) 4 Illustrative Examples We now provide some examples of situations where natural value approximations are useful. In both examples, the value function is difficult to estimate well uniformly in all states we might care about, and the accuracy can be improved by using the natural value estimate Gβt instead of the direct value estimate Vt. Note the mixed recursion in the definition, G depends on G , and vice-versa. 3 Sparse rewards Figure 1 shows an example of value function approximation. To separate concerns, this is a supervised learning setup (regression) with the true value targets provided (dashed black line). Each point 0 ≤ t ≤ 100 on the horizontal axis corresponds to one state St in a single sequence. The shape of the target values stems from a handful of reward events, and discounting with γ = 0.9. We mimic observations that smoothly vary across time by 4 equally spaced radial basis functions, so St ∈ R. The approximators v(s) and β(s) are two small neural networks with one hidden layer of 32 ReLU units each, and a single linear or sigmoid output unit, respectively. The input to β is augmented with the last k = 16 rewards. For the baseline experiment, we fix βt = 0. The networks are trained for 5000 steps using Adam [5] with minibatch size 32. Because of the small capacity of the v-network, the baseline struggles to make accurate predictions and instead it makes systematic errors that smooth over the characteristic peaks and drops in the value function. The natural value estimation obtains ten times lower root mean squared error (RMSE), and it also closely matches the qualitative shape of the target function. Heterogeneous observability Our approach is not limited to the sparse-reward setting. Imagine an agent that stands on the top of a hill. By looking in the distance, the agent may be able to predict how many steps should be taken to take it to the next hill top. When the agent starts descending the hill, it walks into fog in the valley between the hills. There, it can no longer see where it is. However, it could still determine how many steps until the next hill by using the estimate from the first hill and then simply counting steps. This is exactly what the natural value estimate Gβt will give us, assuming βt = 1 on all steps in the fog. Figure 2 illustrates this example, where we assumed each step has a reward of −1 and the discount is one. The best observation-dependent value v(St) is shown in dashed blue. In the fog, the agent can then do no better than to estimate the average number of steps from a foggy state until the next hill top. In contrast, the true value, shown in red, can be achieved exactly with natural value estimates. Note that in contrast to Figure 1, rewards are dense rather than sparse. In both examples, we can sometimes trust past value functions more than current estimations, either because of function approximation error, as in the first example, or partial observability. 0 50 100 step 100 50 0
منابع مشابه
Universal Value Function Approximators
Value functions are a core component of reinforcement learning systems. The main idea is to to construct a single function approximator V (s; θ) that estimates the long-term reward from any state s, using parameters θ. In this paper we introduce universal value function approximators (UVFAs) V (s, g; θ) that generalise not just over states s but also over goals g. We develop an efficient techni...
متن کاملGeneralization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding 1 Reinforcement Learning and Function Approximation 2 Good Convergence on Control Problems
On large problems, reinforcement learning systems must use parame-terized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative...
متن کاملLearning Optimal Policies using Bound Estimation
Reinforcement learning problems related to real-world applications are often characterized by large state spaces, which imply high memory requirements. In the past, the use of function approximators has been studied to get compact representations of the value function. Although function approximators perform well in the supervised learning context, when applied to the reinforcement learning fra...
متن کاملExperiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
A key element in the solution of reinforcement learning problems is the value function The purpose of this function is to measure the long term utility or value of any given state The function is important because an agent can use this measure to decide what to do next A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value...
متن کاملHigh-Dimensional Continuous Control Using Generalized Advantage Estimation
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of th...
متن کامل